101 research outputs found
Exploiting Record Similarity for Practical Vertical Federated Learning
As the privacy of machine learning has drawn increasing attention, federated
learning is introduced to enable collaborative learning without revealing raw
data. Notably, \textit{vertical federated learning} (VFL), where parties share
the same set of samples but only hold partial features, has a wide range of
real-world applications. However, existing studies in VFL rarely study the
``record linkage'' process. They either design algorithms assuming the data
from different parties have been linked or use simple linkage methods like
exact-linkage or top1-linkage. These approaches are unsuitable for many
applications, such as the GPS location and noisy titles requiring fuzzy
matching. In this paper, we design a novel similarity-based VFL framework,
FedSim, which is suitable for more real-world applications and achieves higher
performance on traditional VFL tasks. Moreover, we theoretically analyze the
privacy risk caused by sharing similarities. Our experiments on three synthetic
datasets and five real-world datasets with various similarity metrics show that
FedSim consistently outperforms other state-of-the-art baselines
Privacy-Preserving Gradient Boosting Decision Trees
The Gradient Boosting Decision Tree (GBDT) is a popular machine learning
model for various tasks in recent years. In this paper, we study how to improve
model accuracy of GBDT while preserving the strong guarantee of differential
privacy. Sensitivity and privacy budget are two key design aspects for the
effectiveness of differential private models. Existing solutions for GBDT with
differential privacy suffer from the significant accuracy loss due to too loose
sensitivity bounds and ineffective privacy budget allocations (especially
across different trees in the GBDT model). Loose sensitivity bounds lead to
more noise to obtain a fixed privacy level. Ineffective privacy budget
allocations worsen the accuracy loss especially when the number of trees is
large. Therefore, we propose a new GBDT training algorithm that achieves
tighter sensitivity bounds and more effective noise allocations. Specifically,
by investigating the property of gradient and the contribution of each tree in
GBDTs, we propose to adaptively control the gradients of training data for each
iteration and leaf node clipping in order to tighten the sensitivity bounds.
Furthermore, we design a novel boosting framework to allocate the privacy
budget between trees so that the accuracy loss can be further reduced. Our
experiments show that our approach can achieve much better model accuracy than
other baselines
OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams
How to get insights from relational data streams in a timely manner is a hot
research topic. This type of data stream can present unique challenges, such as
distribution drifts, outliers, emerging classes, and changing features, which
have recently been described as open environment challenges for machine
learning. While existing studies have been done on incremental learning for
data streams, their evaluations are mostly conducted with manually partitioned
datasets. Thus, a natural question is how those open environment challenges
look like in real-world relational data streams and how existing incremental
learning algorithms perform on real datasets. To fill this gap, we develop an
Open Environment Benchmark named OEBench to evaluate open environment
challenges in relational data streams. Specifically, we investigate 55
real-world relational data streams and establish that open environment
scenarios are indeed widespread in real-world datasets, which presents
significant challenges for stream learning algorithms. Through benchmarks with
existing incremental learning algorithms, we find that increased data quantity
may not consistently enhance the model accuracy when applied in open
environment scenarios, where machine learning models can be significantly
compromised by missing values, distribution shifts, or anomalies in real-world
data streams. The current techniques are insufficient in effectively mitigating
these challenges posed by open environments. More researches are needed to
address real-world open environment challenges. All datasets and code are
open-sourced in https://github.com/sjtudyq/OEBench
Effective and Efficient Federated Tree Learning on Hybrid Data
Federated learning has emerged as a promising distributed learning paradigm
that facilitates collaborative learning among multiple parties without
transferring raw data. However, most existing federated learning studies focus
on either horizontal or vertical data settings, where the data of different
parties are assumed to be from the same feature or sample space. In practice, a
common scenario is the hybrid data setting, where data from different parties
may differ both in the features and samples. To address this, we propose
HybridTree, a novel federated learning approach that enables federated tree
learning on hybrid data. We observe the existence of consistent split rules in
trees. With the help of these split rules, we theoretically show that the
knowledge of parties can be incorporated into the lower layers of a tree. Based
on our theoretical analysis, we propose a layer-level solution that does not
need frequent communication traffic to train a tree. Our experiments
demonstrate that HybridTree can achieve comparable accuracy to the centralized
setting with low computational and communication overhead. HybridTree can
achieve up to 8 times speedup compared with the other baselines
Simulation of upper tropospheric COâ‚‚ from chemistry and transport models
The California Institute of Technology/Jet Propulsion Laboratory two-dimensional (2-D), three-dimensional (3-D) GEOS-Chem, and 3-D MOZART-2 chemistry and transport models (CTMs), driven respectively by NCEP2, GEOS-4, and NCEP1 reanalysis data, have been used to simulate upper tropospheric CO2 from 2000 to 2004. Model results of CO2 mixing ratios agree well with monthly mean aircraft observations at altitudes between 8 and 13 km (Matsueda et al., 2002) in the tropics. The upper tropospheric CO2 seasonal cycle phases are well captured by the CTMs. Model results have smaller seasonal cycle amplitudes in the Southern Hemisphere compared with those in the Northern Hemisphere, which are consistent with the aircraft data. Some discrepancies are evident between the model and aircraft data in the midlatitudes, where models tend to underestimate the amplitude of CO2 seasonal cycle. Comparison of the simulated vertical profiles of CO2 between the different models reveals that the convection in the 3-D models is likely too weak in boreal winter and spring. Model sensitivity studies suggest that convection mass flux is important for the correct simulation of upper tropospheric CO2
CO_2 semiannual oscillation in the middle troposphere and at the surface
Using in situ measurements, we find a semiannual oscillation (SAO) in the midtropospheric and surface CO_2. Chemistry transport models (2-D Caltech/JPL model, 3-D GEOS-Chem, and 3-D MOZART-2) are used to investigate possible sources for the SAO signal in the midtropospheric and surface CO_2. From model sensitivity studies, it is revealed that the SAO signal in the midtropospheric CO_2 originates mainly from surface CO_2 with a small contribution from transport fields. It is also found that the source for the SAO signal in surface CO_2 is mostly related to the CO_2 exchange between the biosphere and the atmosphere. By comparing model CO_2 with in situ CO_2 measurements at the surface, we find that models are able to capture both annual and semiannual cycles well at the surface. Model simulations of the annual and semiannual cycles of CO_2 in the tropical middle troposphere agree reasonably well with aircraft measurements
Recommended from our members
Satellite remote sounding of mid-tropospheric CO_2
Human activity has increased the concentration of the earth's atmospheric carbon dioxide, which plays a direct role in contributing to global warming. Mid-tropospheric CO_2 retrieved by the Atmospheric Infrared Sounder shows a substantial spatiotemporal variability that is supported by in situ aircraft measurements. The distribution of middle tropospheric CO_2 is strongly influenced by surface sources and large-scale circulations such as the mid-latitude jet streams and by synoptic weather systems, most notably in the summer hemisphere. In addition, the effects of stratosphere-troposphere exchange are observed during a final stratospheric warming event. The results provide the means to understand the sources and sinks and the lifting of CO_2 from surface layers into the free troposphere and its subsequent transport around the globe. These processes are not adequately represented in three chemistry-transport models that have been used to study carbon budgets
- …